feat(spark): implement substring function#19805
Conversation
| @@ -0,0 +1,207 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
copied the same bench as https://github.com/apache/datafusion/blob/main/datafusion/functions/src/unicode/substr.rs
There was a problem hiding this comment.
Here is a comparison of the perf between the DF built-in and the new spark substring expressions. The perf is comparable
| Benchmark | Spark substring Time | DF substring Time |
|---|---|---|
SHORTER THAN 12/substr_string_view [size=1024, strlen=12] |
[9.9232 µs 10.026 µs 10.152 µs] |
[7.2600 µs 7.2906 µs 7.3209 µs] |
SHORTER THAN 12/substr_string [size=1024, strlen=12] |
[7.6418 µs 7.7211 µs 7.8093 µs] |
[8.4374 µs 9.0174 µs 10.083 µs] |
SHORTER THAN 12/substr_large_string [size=1024, strlen=12] |
[8.1936 µs 8.2795 µs 8.3950 µs] |
[8.8816 µs 8.9169 µs 8.9539 µs] |
LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128] |
[14.973 µs 15.169 µs 15.429 µs] |
[12.647 µs 12.770 µs 12.904 µs] |
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128] |
[12.132 µs 12.183 µs 12.233 µs] |
[13.968 µs 14.022 µs 14.074 µs] |
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128] |
[12.259 µs 12.317 µs 12.379 µs] |
[13.995 µs 14.109 µs 14.280 µs] |
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128] |
[22.497 µs 22.631 µs 22.793 µs] |
[14.784 µs 14.901 µs 15.075 µs] |
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128] |
[20.755 µs 20.850 µs 20.978 µs] |
[15.804 µs 15.861 µs 15.914 µs] |
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128] |
[20.794 µs 21.045 µs 21.459 µs] |
[15.771 µs 15.816 µs 15.860 µs] |
SHORTER THAN 12/substr_string_view [size=4096, strlen=12] |
[38.334 µs 38.569 µs 38.934 µs] |
[26.812 µs 27.032 µs 27.307 µs] |
SHORTER THAN 12/substr_string [size=4096, strlen=12] |
[27.456 µs 27.548 µs 27.640 µs] |
[32.104 µs 32.449 µs 32.802 µs] |
SHORTER THAN 12/substr_large_string [size=4096, strlen=12] |
[30.323 µs 30.777 µs 31.286 µs] |
[34.284 µs 34.514 µs 34.858 µs] |
LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128] |
[55.663 µs 56.315 µs 57.106 µs] |
[47.091 µs 47.321 µs 47.639 µs] |
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128] |
[40.263 µs 40.768 µs 41.399 µs] |
[51.885 µs 52.049 µs 52.253 µs] |
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128] |
[43.083 µs 43.261 µs 43.452 µs] |
[51.839 µs 52.048 µs 52.348 µs] |
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128] |
[84.649 µs 85.064 µs 85.500 µs] |
[55.237 µs 55.607 µs 56.018 µs] |
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128] |
[76.146 µs 77.393 µs 79.422 µs] |
[60.728 µs 60.998 µs 61.235 µs] |
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128] |
[76.974 µs 77.166 µs 77.365 µs] |
[60.319 µs 60.630 µs 60.976 µs] |
4f1fbc2 to
5cd1ed3
Compare
| make_udf_function!(substring::SparkSubstring, substring); | ||
| make_udf_function!(substring::SparkSubstring, substr); |
There was a problem hiding this comment.
thats when I was debugging why the SLT tests kept falling back to the other implementation 😅 when I realized that I needed the custom expr planner it fixed things. will remove the dup
ca1ee8a to
d9cfca8
Compare
| if start >= 0 { | ||
| start.max(1) | ||
| } else { | ||
| let start = start + len as i64 + 1; |
There was a problem hiding this comment.
Imaginary, but possible input:
fn main() {
// input
let start = -1_i64;
let len = 1_u64 + i64::MAX as u64;
// the tested logic
let start = start + len as i64 + 1;
println!("len: {len}\nstart: {start}");
}fails with:
thread 'main' (40) panicked at src/main.rs:6:17:
attempt to add with overflow
There was a problem hiding this comment.
added some guardrails against overflows
| start_array: &Int64Array, | ||
| length_array: Option<&Int64Array>, | ||
| ) -> Result<ArrayRef> { | ||
| let mut builder = StringViewBuilder::new(); |
There was a problem hiding this comment.
The body of this function is very similar to spark_substring_utf8().
Could it be extracted and reused ?
AFAIS only the builder type is different.
There was a problem hiding this comment.
good catch, updated to avoid the duplicate code
Which issue does this PR close?
Part of [EPIC] Complete
datafusion-sparkSpark Compatible Functions #15914Closes [datafusion-spark]: Implement substring function #19803.
Rationale for this change
What changes are included in this PR?
Implementation of spark substring function.
Spark implementation:
Are these changes tested?
yes
Are there any user-facing changes?